Analysing Seattle's 911 calls

In [15]:
#importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
In [16]:
data=pd.ExcelFile('rev data for test.xlsx')
In [17]:
#one sheet present in Excel
data.sheet_names
Out[17]:
['data for test']
In [18]:
#converting dataset into DataFrame
df = data.parse('data for test')
In [19]:
#Faamiliarising with Data!
df.head()
Out[19]:
Type Latitude Longitude Report Location
0 Beaver Accident 47.6992 -122.2167 (47.6291923608656, -122.186728398282)
1 Beaver Accident 47.6977 -122.2164 (47.5576821104334, -122.156421437319)
2 Beaver Accident 47.6967 -122.2131 (47.6167258135906, -122.173139389518)
3 Beaver Accident 47.6971 -122.2178 (47.5370517340417, -122.197755316941)
4 Beaver Accident 47.6925 -122.2127 (47.6124577512516, -122.14272010056)
In [20]:
df.describe()
Out[20]:
Latitude Longitude
count 1514.000000 1514.000000
mean 47.618480 -122.284465
std 0.051916 0.089676
min 47.500200 -122.469940
25% 47.586008 -122.355300
50% 47.608487 -122.301850
75% 47.672450 -122.185650
max 47.732000 -122.140100
In [21]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1514 entries, 0 to 1513
Data columns (total 4 columns):
Type               1514 non-null object
Latitude           1514 non-null float64
Longitude          1514 non-null float64
Report Location    1514 non-null object
dtypes: float64(2), object(2)
memory usage: 47.4+ KB
In [323]:
sns.pairplot(data=df,hue='Type')
Out[323]:
<seaborn.axisgrid.PairGrid at 0x109f3423da0>

Most common reason for calling 911

In [322]:
ax = df['Type'].value_counts().plot(kind='barh', figsize=(10,7),
                                        color="coral", fontsize=12);
ax.set_alpha(0.8)
ax.set_title("Most common reasons for calling 911 in seattle?", fontsize=18)
ax.set_xlabel("Number of calls", fontsize=18);

# create a list to collect the plt.patches data
totals = []

# find the values and append to list
for i in ax.patches:
    totals.append(i.get_width())

# set individual bar lables using above list
total = sum(totals)

# set individual bar lables using above list
for i in ax.patches:
    # get_width pulls left or right; get_y pushes up or down
    ax.text(i.get_width()+7, i.get_y()+.38, \
            str(int(i.get_width()))+'  ('+str(round((i.get_width()/total)*100, 2))+'%)', fontsize=15,
color='dimgrey')

# invert for largest on top 
ax.invert_yaxis()

Reasons for 911 calls VS number of calls for each reason

In [39]:
#The same with interactive visualization
df.groupby('Type')['Type'].count().iplot(kind='bar') 
In [23]:
import cufflinks as cf
cf.go_offline()

A Scatter plot on how the 4 types of calls are distributed using plotly (interactive)

In [325]:
data = []
i = 0
color_set = ['#FE9C43','#7fc97f','#fc8d62','#66c2a5' ]
for col in df['Type'].unique():
    data.append(go.Scatter(x=df[df['Type'] == col]['Latitude'], y=df[df['Type'] == col]['Longitude'], 
                           mode='markers', line=dict(color=color_set[i], width=1, dash='dash'), 
                           marker=dict(color=color_set[i], size=10), name=col))
    i += 1

layout = go.Layout(
    xaxis = dict(
        title = 'Latitude',
    ),
    yaxis = dict(
        title = 'Longitude',
    ),
)
    
fig = go.Figure(data=data, layout=layout)

iplot(fig)  
In [329]:
#Scatter plot on how the 4 types of calls are distributed using Seaborn
sns.lmplot(x='Latitude',y='Longitude', data=df,hue='Type',size=8)
Out[329]:
<seaborn.axisgrid.FacetGrid at 0x109f3b6ae10>
In [99]:
import plotly.plotly as py
import pandas as pd
from plotly import __version__
print(__version__)
import cufflinks as cf
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
import plotly.figure_factory as ff
from plotly.tools import FigureFactory as FF
init_notebook_mode(connected=True)
import plotly.graph_objs as go

cf.go_offline()
2.2.1
In [233]:
#utility function to return some differentiate categories (those numbers are random and are used for 
#choosing colors for the US map visualization)
def label_rows(row):
    if row['Type']=='Beaver Accident':
        return 1
    if row['Type']=='Seal Attack':
        return 8
    if row['Type']=='Latte Spills':
        return 20
    if row['Type']=='Marshawn Lynch Sighting':
        return 44

#creating a new column from the existing column for building a better model    
df['type_label'] = df.apply (lambda row: label_rows(row),axis=1)

Using Cloropleth maps with plotly for visualizing scatter plot on the US Geography Map

In [332]:
data = [ dict(
        type = 'scattergeo',
        locationmode = 'US-states',
        lon = df['Longitude'],
        lat = df['Latitude'],
        #text = df['Type'],
        mode = 'markers',
        marker = dict(
            size = 8,
            opacity = 0.8,
            reversescale = True,
            autocolorscale = False,
            symbol = 'circle',
            line = dict(
                width=0.5
            ),
            cmin = 0,
            color = df['type_label'],
            cmax = df['type_label'].max(),
        ))]

layout = dict(
    geo = dict(
        scope = 'north west america',
        showland = True,
        landcolor = "rgb(212, 212, 212)",
        subunitcolor = "rgb(255, 255, 255)",
        countrycolor = "rgb(255, 255, 255)",
        showlakes = True,
        lakecolor = "blue",
        showsubunits = True,
        showcountries = True,
        showocean=True,
        resolution = 50,
        projection = dict(
            type = 'conic conformal',
            rotation = dict(
                lon = -100
            )
        ),
        lonaxis = dict(
            showgrid = True,
            gridwidth = 0.5,
            range= [ -160.0, -55.0 ],
            dtick = 5
        ),
        lataxis = dict (
            showgrid = True,
            gridwidth = 0.5,
            range= [ 35.0, 75.0 ],
            dtick = 5
        )
    ),
    title = '911 calls in Seattle',
)
fig = { 'data':data, 'layout':layout }
iplot(fig)
#This is an interactive viz, please zoom in to seattle state(the orange dot on the viz) to visualize all the points 
#with different colors 

Dark-Blue points indicate - 'Marshawn Lynch Sighting'

Light-Blue points indicate - 'Latte Spills'

Orange points indicate - 'Seal Attack'

Red points indicate - 'Beaver Accident'

In [235]:
data = []
i = 0
color_set = ['#FE9C43','#7fc97f','#fc8d62','#66c2a5' ]
for col in df['Type'].unique():
    data.append(go.Scatter(x=df[df['Type'] == col]['Latitude'], y=df[df['Type'] == col]['Longitude'], 
                           mode='markers', line=dict(color=color_set[i], width=1, dash='dash'), 
                           marker=dict(color=color_set[i], size=10), name=col))
    i += 1

layout = go.Layout(
    yaxis = dict(
        title = 'Latitude',
    ),
    xaxis = dict(
        title = 'Longitude',
    ),
)
    
fig = go.Figure(data=data, layout=layout)

iplot(fig)  

From the above plot we see that some points look mislabled -mostly the blue dots on red cluster i.e. Marshawn Lynch is sighted where a lot of 911 calls are recorded for Latte Spills.

Using KMeans Clustering using only Latitude and Longitude features to see if we can predict why a resident called 911

KMeans Clustering

In [240]:
from sklearn.cluster import KMeans
kmeans=KMeans(n_clusters=4)  #I took number of clusters=4 because we already know there are only 4 types of 911 calls
dat=['Latitude','Longitude']
kmeans.fit(df[dat])  #model fitting
Out[240]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
In [241]:
kmeans.cluster_centers_ #predicted cluster centers
Out[241]:
array([[  47.58520486, -122.17125506],
       [  47.59580511, -122.39803454],
       [  47.68369382, -122.32643645],
       [  47.58855643, -122.31079972]])
In [242]:
#predicted labels
kmeans.labels_  
Out[242]:
array([2, 2, 2, ..., 1, 1, 3])
In [243]:
df.describe()
Out[243]:
Latitude Longitude type_label
count 1514.000000 1514.000000 1514.000000
mean 47.618480 -122.284465 2.367239
std 0.051916 0.089676 1.154266
min 47.500200 -122.469940 1.000000
25% 47.586008 -122.355300 1.000000
50% 47.608487 -122.301850 2.000000
75% 47.672450 -122.185650 3.000000
max 47.732000 -122.140100 4.000000
In [249]:
#utility function to convert category into numerical data. I could also use pd.get_dummies()

def cluster(row):
    if row['Type']=='Beaver Accident':
        return 0
    elif row['Type']=='Seal Attack':
        return 1
    elif row['Type']=='Latte Spills':
        return 2
    elif row['Type']=='Marshawn Lynch Sighting':
        return 3
    else:
        return 4
    
df['type_label'] = df.apply (lambda row: cluster(row),axis=1)
In [251]:
#Metrics for evaluating the model
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(df['type_label'],kmeans.labels_))
print(classification_report(df['type_label'],kmeans.labels_))
[[499   2   6   1]
 [  3 240   5  18]
 [  0   0 416   0]
 [  0  19  47 258]]
             precision    recall  f1-score   support

          0       0.99      0.98      0.99       508
          1       0.92      0.90      0.91       266
          2       0.88      1.00      0.93       416
          3       0.93      0.80      0.86       324

avg / total       0.94      0.93      0.93      1514

From the above metrics, The kmeans model did a decent job with few mis classifications.
number of correct classifications: 1413
number of misclassifications: 101
percentage of correct prediction: 93.3%

Major number of misclassifications (47 from the confusion matrix) are Marshawn Lynch sightings.These are predicted as Latte Spills.

This also reconfirms our hypothesis that the type of accident calls can be predicted just from the available co-ordinates.

Should we be concerned about 'Latitude' and 'Longitude' are not necessarily Eucledian?

Yes, It concerns us a bit because when the 'Latitude' and 'Longitude' are converted as numbers, they loose their real world meaning. But, as our model performed well with 94%, we can manage with 'Latitude' and 'Longitude' as distances. But if our model doesnt perform decent enough, then we need to add some functions which convert Lat,Long to some other form like a vector, which can be more relate to Geographical co-ordinates.

In [296]:
#lets see how the number of clusters changes the error rate of of model. 
error_rate = []
for i in range(1,10):
    kmeans=KMeans(n_clusters=i)
    dat=['Latitude','Longitude']
    kmeans.fit(df[dat],df['type_label'])
    pred_i = kmeans.predict(X_test)
    error_rate.append(np.mean(pred_i!=y_test))
    

plt.figure(figsize=(10,6))
plt.plot(range(1,10),error_rate,color='blue',linestyle='dashed',marker='o',
         markerfacecolor='red',markersize=10)
plt.title("Error Rate vs K Value - KMeans")
plt.xlabel('K')
plt.ylabel('Error Rate')
Out[296]:
<matplotlib.text.Text at 0x109f1caf4a8>

From the above plot, we see that when the number of clusters = 4 or 5 or 6, There is least error-rate. k=1 also has least error rate but we can eliminate it because - all the points belongs to one category, doesn't make sense.

KNN Classification

Lets build a KNN model to see if we can improve in the previous model

In [255]:
#for KNN, we need to normalize the features so that it won't be biased. For that we use StandardScaler
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
In [261]:
dat=['Latitude','Longitude']
scaler.fit(df[dat])
Out[261]:
StandardScaler(copy=True, with_mean=True, with_std=True)
In [262]:
scaled_features = scaler.transform(df[dat])
In [263]:
#normalized features
scaled_features
Out[263]:
array([[ 1.55533221,  0.7559141 ],
       [ 1.52642979,  0.7592606 ],
       [ 1.50716151,  0.79607208],
       ...,
       [-0.10944709, -0.98997081],
       [-0.16339827, -0.86409673],
       [-0.35608106, -0.7364769 ]])
In [264]:
df_feat = pd.DataFrame(scaled_features,columns=['Latitude','Longitude'])
In [265]:
df_feat.head()
Out[265]:
Latitude Longitude
0 1.555332 0.755914
1 1.526430 0.759261
2 1.507162 0.796072
3 1.514869 0.743644
4 1.426235 0.800534
In [266]:
#model evaluation 
from sklearn.cross_validation import train_test_split
C:\Anaconda\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning:

This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

In [268]:
#lets split the model into train and test sets 
X=df_feat
y=df['type_label']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,random_state=10)
In [274]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(X_train,y_train)
Out[274]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='uniform')
In [302]:
pred_k = knn.predict(X_test)
In [303]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,pred_k))
print(classification_report(y_test,pred_k))
[[147   1   0   0]
 [  2  80   3   2]
 [  0   0 122   1]
 [  0   1  14  82]]
             precision    recall  f1-score   support

          0       0.99      0.99      0.99       148
          1       0.98      0.92      0.95        87
          2       0.88      0.99      0.93       123
          3       0.96      0.85      0.90        97

avg / total       0.95      0.95      0.95       455

Hey! This model does better job in classifying correctly than the previous model. This can be seen from the number of misclasifications in confusion matrix, precision and recall values.

number of correct classifications: 431 #(only Test data)
number of misclassifications: 24
percentage of correct prediction: 94.7%

In [309]:
error_rate = []
for i in range(1,10):
    knn=KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i!=y_test))
In [311]:
plt.figure(figsize=(10,6))
plt.plot(range(1,10),error_rate,color='blue',linestyle='dashed',marker='o',
         markerfacecolor='red',markersize=10)
plt.title("Error Rate vs K Value - KNN")
plt.xlabel('K')
plt.ylabel('Error Rate')
Out[311]:
<matplotlib.text.Text at 0x109f3144f98>

This model predicts the category by seeing its 3 nearest neighbors label. Lets see again with k=3

In [307]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,pred_k))
print(classification_report(y_test,pred_k))
[[147   1   0   0]
 [  2  80   3   2]
 [  0   0 122   1]
 [  0   1  14  82]]
             precision    recall  f1-score   support

          0       0.99      0.99      0.99       148
          1       0.98      0.92      0.95        87
          2       0.88      0.99      0.93       123
          3       0.96      0.85      0.90        97

avg / total       0.95      0.95      0.95       455

Lets see with Random Forests because we know most of the times Ensemble models does pretty good job when compared with other models. First lets check how a decision tree classifier performs and then I'll compare it with Random Forest classifier

Decision Tree Classification

In [333]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()

dtree.fit(X_train,y_train)
Out[333]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [334]:
predictions=dtree.predict(X_test)
In [335]:
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
[[147   1   0   0]
 [  2  77   0   8]
 [  0   0 119   4]
 [  0   2  13  82]]
             precision    recall  f1-score   support

          0       0.99      0.99      0.99       148
          1       0.96      0.89      0.92        87
          2       0.90      0.97      0.93       123
          3       0.87      0.85      0.86        97

avg / total       0.93      0.93      0.93       455

number of correct classifications: 425 number of misclassifications: 30 percentage of correct prediction: 93.4%

The decisionTree classifier did not perform better than what Kmeans and KNN models did.

Random Forest Classification

In [319]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=250)
rfc.fit(X_train,y_train)
rfc_pred = rfc.predict(X_test)
print(confusion_matrix(y_test,rfc_pred))
print(classification_report(y_test,rfc_pred))
[[147   1   0   0]
 [  2  75   2   8]
 [  0   0 123   0]
 [  0   1  13  83]]
             precision    recall  f1-score   support

          0       0.99      0.99      0.99       148
          1       0.97      0.86      0.91        87
          2       0.89      1.00      0.94       123
          3       0.91      0.86      0.88        97

avg / total       0.94      0.94      0.94       455

number of correct classifications: 428
number of misclassifications: 27
percentage of correct prediction: 94%

Even Random Forest classifier also didn't perform very well. But also, It is little hard to tell which performed better because it really depends on what you value, whether you value Precision or Recall. Its probably more important to realise that 911 calls for Beaver accidents and seal attacks, which are more dangerous and which need immediate attention are not classified as(very few are classified like this) something less emergency like calls from sighting Marshawn Lynch or Latte Spills.

But again, it really depends upon the situation and what costs are associated with those decisions.

The main reason for not misclassifications is Marshawn Lynch sightings are classified as Latte spills. Our hypothesis also reconfirms that Marshawn Lynch sightings points on Latte Spills cluster!

Thank You!!